Text Files

Data is often stored in formats that are not easy to read unless you have some specialised software. Excel spreadsheets are one example; how do you read an Excel spreadsheet if you don't have Excel? This reliance on particular software is not very useful for programmatic analysis of data. Instead if we're going to use a programming language to analyse or process our data we would prefer some "easy to deal with" format. Text files or flatfiles provide just the sort of format we need (there are other choices). You can think of these as a single sheet from a spreadsheet with the data arranged in rows (separated by our old friend the \n character) and columns (separated by some other character).

There are two common characters used to separate the data in text files into columns. The \t, or tab stop character and the humble comma (,). In both cases fields in our input file are separated by a single tab stop or a single comma.

This gives rise to two commonly used file extensions (the bit after the dot in file names e.g. in myfile.docx the 'docx' is the file extension.). These file extensions are .tsv for 'tab separated file' and .csv for 'comma separated file'. Other separators in text files of data may be little used characters such as '|', ':' or spaces. As a quick aside separating fields in data files by spaces is tricky because an individual value might also contain spaces and therefore be inappropriately divided over more than one column.

As an illustration if you were to print a couple of lines from a .tsv file out it would look something like this:

data_field1\tdata_field2\tdata_field3\n data_field1\tdata_field2\tdata_field3\n

Each field is separated by a tab stop \t and the end of a line is indicated by the \n combination.

In a .csv file you would see:

data_field1,data_field2,data_field3\n data_field1,data_field2,data_field3\n

Note:

It's important to note that there is NO STANDARD for either the layout OR naming of text files. In the exercises that follow the text file we will be working with has the extension .csv which should mean comma separated but in fact the data is separated by tabs - because that's what I'm in the habit of doing. Please ignore my bad habits!


Opening (and Reading) Files

When you open a file in a programme such as Word or Excel you have to select the file you want from a 'File -> Open' dialogue of some kind. In other words you have to point the programme at the specific file you want which is stored at some specific location on your computer. This is also true when you open a file in Python. The difference is that when you open a file in python you have to specify the file location in words (as a filepath) rather then selecting from a dialogue. So the first thing we have to do is assign the file location to a variable. This textual representation of the file location is called the filepath.

Once we have the filepath we can use the python function open() to 'get a handle' on the file. This file handle should also be assigned to a variable. We can then operate on that variable. Let's see an example.


In [1]:
file_loc = 'data/elderlyHeightWeight.csv'
f_hand = open(file_loc, 'r')
print(f_hand)


<_io.TextIOWrapper name='data/elderlyHeightWeight.csv' mode='r' encoding='UTF-8'>

In the example above we have opened a small file containing height and weight information on a group of elderly men and women from a body composition study.

In the first line we assign the file path to the variable file_loc. We then use that variable to open a file handle (f_hand) using the open() function. The file location is the first argument to the open() function and the second argument 'r' indicates that we want to read from the file i.e. we want access to the data in the file but we don't want to change the file itself. Finally we use a print statement to print the file handle.

The results of the print statement might surprise you. Rather than printing the contents of the file what we get is a representation of the location in the computer memory of where the file is i.e. at memory location 0x7... etc in the example above.

Whilst we can also open Excel or Word files in python this requires the use of special software libraries. We'll see some of those later in the course. Mostly when we are analysing data in files we open simple text files. Both Excel and Word can save files out as simple text files.

The first few lines of the file we will work with are shown below.


In [2]:
!head -n 4 data/elderlyHeightWeight.csv


Gender	Age	body wt	ht
F	77	63.8	155.5
F	80	56.4	160.5
F	76	55.2	159.5

While the file handle does not contain the data from the file (it only points at it) it is easy to construct a for loop to cycle through the lines of the file and carry out some computation. For example we can easily count the number of lines in the file.


In [3]:
count = 0 # initialise

for line in f_hand: # iterate
    count = count + 1
    
print('There are %d lines in the file.' % count    )


There are 19 lines in the file.

Why doesn't open() open the file directly?

It might seem stupid that after using open() a file handle is created but the file contents aren't directly available. The reason the open() function does not read the whole file immediately but just points to it is to do with file size. If you do not know in advance the size of the files you are dealing with (often the case - how often do you check the size of files you open on your computer?) automatically opening very large files could:

  • Take a long time
  • Crash the whole computer system - essentially you'd run out of memory

For some biological applications the data files (which may well be text files e.g. in RNA Seq data - see comment here) can be very large. So it's safer to point to the file rather than automatically open it. This is also true of other data or informatics applications.

In the for loop above python splits the file into lines based on the newline character (\n - the split is implicit), increments the count variable by 1 for every line and then discards that line. So there is only ever one line from the file in the computer memory at any given time.

If you know that your file is likely to be small you can read the whole file into memory with the read() method (remember the dot notation!).

This reads the entire contents of the file, newlines and all, into one large string.


In [4]:
file_loc = 'data/elderlyHeightWeight.csv'
f_hand = open(file_loc, 'r')
f_data = f_hand.read()
print(f_data)


Gender	Age	body wt	ht
F	77	63.8	155.5
F	80	56.4	160.5
F	76	55.2	159.5
F	77	58.5	151
F	82	64	165.5
F	78	51.6	167
F	85	54.6	154
F	83	71	153
M	79	75.5	171
M	75	83.9	178.5
M	79	75.7	167
M	84	72.5	171.5
M	76	56.2	167
M	80	73.4	168.5
M	75	67.7	174.5
M	75	93	168
M	78	95.6	168
M	80	75.6	183.5


In [5]:
f_hand.close()
f_hand = open(file_loc)
print(f_hand.readline(), end = '')
print(f_hand.readline(), end = '')
f_hand.close()
f_hand = open(file_loc)
lines = f_hand.readlines()
print(type(lines))


Gender	Age	body wt	ht
F	77	63.8	155.5
<class 'list'>

In [6]:
file_loc = 'data/elderlyHeightWeight.csv'
f_hand = open(file_loc, 'r')
f_data = f_hand.read()
f_data[:22]
print(len(f_data))
print( f_data[:10])
r'{}'.format(f_data[:22]) # note the tab stop in the output


286
Gender	Age
Out[6]:
'Gender\tAge\tbody wt\tht\n'

In the above example we first create the file handle and then read the entire contents of the file into one string. We check the length of that string (286 characters including whitespace characters like \n) and we print the first 10 characters (refer back to the material on slicing if you're unsure how the [:10] slice works). The print statement interprets the tab stop properly but if we just ask for the first 22 characters to be returned (i.e. we do not use print) we can see the tab stop and the \n. Compare this to the illustration of a .tsv file shown above.

Using the print statement and subsetting is fine but not convenient. You might want to print the whole of the first line. The readline() method will read one line at a time and you can use this to e.g. just display the header line (if you know or suspect that your file has a header line).


In [7]:
f_hand = open(file_loc, 'r')
line = f_hand.readline() # reads first line
print(line)
# next line 
line=f_hand.readline()
print(line)


Gender	Age	body wt	ht

F	77	63.8	155.5

After reading in the current line readline() then moves on to the next line. So calling readline() again uses the next line in the file. One other thing to bear in mind is that readline() leaves whitespace and in particular the \n character at the end of the line. You can see that above (there's a blank line between the printed lines) in the following example.


In [8]:
line = f_hand.readline() # note next line has been read
line # compare to print above


Out[8]:
'F\t80\t56.4\t160.5\n'

In [ ]:

If you're using python to join lines together in some new format that might not be what you want. There is a method, strip()(see here) that removes whitespace at the end of lines and can be used to remove this potentially extraneous \n character. Note also that methods can be chained together so you can use readline() and strip() sequentially using the following syntax.


In [10]:
f_hand = open(file_loc, 'r') # read in in file again to get header line
line = f_hand.readline().strip() # read the line then strip the whitespace at the end of the line
line # no \n!


Out[10]:
'Gender\tAge\tbody wt\tht'

Use of strip() has removed the \n from our selected line. There are also lstrip() and rstrip() methods that strip whitespace from only the left or right sides of a string respectively.

Just to confuse you further there's also a readlines() method that reads all the lines in the file into a list. Again the lines are separated on the invisible \n character. This can be handy because you can assign the list to a variable and then loop through the list to print file lines or simply extract the lines you want using slice notation.


In [11]:
f_hand = open(file_loc, 'r')
lines = f_hand.readlines()
print(lines[0:2]) # check we have a list
print(len(lines))


['Gender\tAge\tbody wt\tht\n', 'F\t77\t63.8\t155.5\n']
19

In [12]:
for i in range(4):
    print (lines[i].strip())


Gender	Age	body wt	ht
F	77	63.8	155.5
F	80	56.4	160.5
F	76	55.2	159.5

Alternative Implementations (just for fun)

no. 1


In [13]:
for line in lines[:4]:
    print(line.strip())


Gender	Age	body wt	ht
F	77	63.8	155.5
F	80	56.4	160.5
F	76	55.2	159.5

no. 2


In [14]:
for i, line in enumerate(lines):
    if i == 4:
        break
    print(line.strip())


Gender	Age	body wt	ht
F	77	63.8	155.5
F	80	56.4	160.5
F	76	55.2	159.5

Just like readline() the readlines() method leaves the trailing \n at the end of the line but you can use strip() to remove it if you have to as we did above. We had to use the strip() method on the individual line rather than on the list of lines as lists strip() does not operate on lists. Try moving the strip() to the end of lines = f_hand.readlines() and see what kind if error you get.

Finally (and perhaps most usefully) there is the splitlines() method that does the same as readlines() but drops the trailing \n automatically.


In [16]:
f_hand = open(file_loc, 'r')
lines = f_hand.read().splitlines() # read file, then split lines to lists, drops trailing \n
for i in range(4):
    print (lines[i])


Gender	Age	body wt	ht
F	77	63.8	155.5
F	80	56.4	160.5
F	76	55.2	159.5

Notice we didn't have to use strip().

One final thing to note is that whenever we finish with a file we should close it. Leaving files 'open' after data has been read from them can lead to increasing amounts of memory being used and also corruption of the file. Closing files is accomplished by using the close() method on the file handle. Also illustrated is a simple filter to print out only the male data using the string method startswith() - which returns a boolean value depending on whether the line begins with the given argument (M in this case) or not.


In [17]:
file_loc = 'data/elderlyHeightWeight.csv' # relative path
f_hand = open(file_loc, 'r')
lines = f_hand.read().splitlines() # lines to a list
print (lines[0]) # header

for line in lines: # loop to filter
    if line.startswith('M'):
        print (line)
    
f_hand.close()


Gender	Age	body wt	ht
M	79	75.5	171
M	75	83.9	178.5
M	79	75.7	167
M	84	72.5	171.5
M	76	56.2	167
M	80	73.4	168.5
M	75	67.7	174.5
M	75	93	168
M	78	95.6	168
M	80	75.6	183.5

filter alternative


In [18]:
def filter_function(line):
    return line.startswith('M')

In [19]:
f_hand = open(file_loc)
lines = f_hand.readlines()
male_gender = filter(filter_function, lines)
print(lines[0].strip())
for ml in male_gender:
    print(ml.strip())
f_hand.close()


Gender	Age	body wt	ht
M	79	75.5	171
M	75	83.9	178.5
M	79	75.7	167
M	84	72.5	171.5
M	76	56.2	167
M	80	73.4	168.5
M	75	67.7	174.5
M	75	93	168
M	78	95.6	168
M	80	75.6	183.5

Using lambda expressions


In [29]:
f_hand = open(file_loc)
male_gender = filter(lambda l: l.startswith('M'), f_hand)
for ml in male_gender:
    print(ml.strip())
f_hand.close()


M	79	75.5	171
M	75	83.9	178.5
M	79	75.7	167
M	84	72.5	171.5
M	76	56.2	167
M	80	73.4	168.5
M	75	67.7	174.5
M	75	93	168
M	78	95.6	168
M	80	75.6	183.5

Exercises

Show the content of the file using a Shell command

Tip 1: The shell command to be used could be cat

Tip 2: Remember the ! (esclamation mark)


In [22]:
!cat data/elderlyHeightWeight.csv


Gender	Age	body wt	ht
F	77	63.8	155.5
F	80	56.4	160.5
F	76	55.2	159.5
F	77	58.5	151
F	82	64	165.5
F	78	51.6	167
F	85	54.6	154
F	83	71	153
M	79	75.5	171
M	75	83.9	178.5
M	79	75.7	167
M	84	72.5	171.5
M	76	56.2	167
M	80	73.4	168.5
M	75	67.7	174.5
M	75	93	168
M	78	95.6	168
M	80	75.6	183.5

Ex. no 2

Print all the lines in the file where the Age value is in the range [70, 80)


In [23]:
f_hand = open(file_loc)
for i, line in enumerate(f_hand):
    if i == 0:
        continue
    line = line.strip()
    _, age, *_ = line.split('\t')
    if 70 <= int(age) < 80:
        print(line)


F	77	63.8	155.5
F	76	55.2	159.5
F	77	58.5	151
F	78	51.6	167
M	79	75.5	171
M	75	83.9	178.5
M	79	75.7	167
M	76	56.2	167
M	75	67.7	174.5
M	75	93	168
M	78	95.6	168

Ex. no 3

Print the two lines in the files for each gender corresponding to the two entries with the (relative) maximum value of body weight (body wt) plus height (ht).

Sol #1: Using a Dictionary


In [24]:
info = {}  # Dictonary holding per-sex lines info
f_hand = open(file_loc)
lines = f_hand.read().splitlines()
for l in lines[1:]:
    l = l.strip()
    key = l[0]
    info.setdefault(key, [])
    info[key].append(tuple(l.split('\t')))

In [25]:
from pprint import pprint  # pprint is for **pretty printing** structures
pprint(info)


{'F': [('F', '77', '63.8', '155.5'),
       ('F', '80', '56.4', '160.5'),
       ('F', '76', '55.2', '159.5'),
       ('F', '77', '58.5', '151'),
       ('F', '82', '64', '165.5'),
       ('F', '78', '51.6', '167'),
       ('F', '85', '54.6', '154'),
       ('F', '83', '71', '153')],
 'M': [('M', '79', '75.5', '171'),
       ('M', '75', '83.9', '178.5'),
       ('M', '79', '75.7', '167'),
       ('M', '84', '72.5', '171.5'),
       ('M', '76', '56.2', '167'),
       ('M', '80', '73.4', '168.5'),
       ('M', '75', '67.7', '174.5'),
       ('M', '75', '93', '168'),
       ('M', '78', '95.6', '168'),
       ('M', '80', '75.6', '183.5')]}

In [26]:
max_male = max(info['M'], key=lambda e: float(e[2]) + float(e[3]))
print(max_male)


('M', '78', '95.6', '168')

Sol. #2: Using a list comprehension


In [27]:
## Creating Partial Lists using **List Comprehension**
males = [l.strip().split('\t') for l in lines[1:] 
             if l.startswith('M')]
females = [l.strip().split('\t') for l in lines[1:] 
               if l.startswith('F')]

In [28]:
males


Out[28]:
[['M', '79', '75.5', '171'],
 ['M', '75', '83.9', '178.5'],
 ['M', '79', '75.7', '167'],
 ['M', '84', '72.5', '171.5'],
 ['M', '76', '56.2', '167'],
 ['M', '80', '73.4', '168.5'],
 ['M', '75', '67.7', '174.5'],
 ['M', '75', '93', '168'],
 ['M', '78', '95.6', '168'],
 ['M', '80', '75.6', '183.5']]

In [30]:
max_male = max(males, key=lambda e: float(e[2]) + float(e[3]))
print(max_male)


['M', '78', '95.6', '168']

The csv module

Getting the data from a file and doing something with it is all well and good. However once we've done our analysis we usually want to save the results to another file. We can do this using base python but it's easier if we use a python library, in this case the csv library. We'll learn more about libraries in the next unit but for now just consider libraries as extra python code that you can get access to if you need it. In fact that's exactly what many libraries are. So the quesion arises 'how do we get access to a library?'. We have to tell python we want to use the library up front. To do this we use the import statement.


In [31]:
import csv

It's that simple! Now python makes available to us all the useful code in the csv library. The csv library, unsurprisingly, contains python functions and methods to make dealing with csv (and other) text files easier. Let's first see how to open a text file using the csv library and printing out the first few lines.

To read data from a csv file, we use the reader() function. The reader() function takes each line of the file and makes a reader object containing lists made up of each row in the input data. Objects in programming are containers for both data and methods that act on that data (a bit esoteric so don't worry if you don't quite get that). One method the reader object supports is the .next() method. We can use this to access each row at a time. Notably once we have processed the line it's gone from the reader object.

Note:

From here on, we are going to keep using the with/as statement to handle I/O operations, namely Context Manager objects.

For more information, see this notebook.


In [36]:
# import csv - already done
with open('data/elderlyHeightWeight.csv', 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter='\t') # define the field delimiter
    header = next(reader)
    print (header)
    print () # blank line

    for i in range(4):
        print (next(reader)) # print the first 4 lines after the header


['Gender', 'Age', 'body wt', 'ht']

['F', '77', '63.8', '155.5']
['F', '80', '56.4', '160.5']
['F', '76', '55.2', '159.5']
['F', '77', '58.5', '151']

We can see that the reader() function has processed each line into a single list element based on the field delimiter we supplied. Importantly also note that all the values are now of type str in each list (everything is in quotes). This is important if you want to do calculations on these values.

Using the csv module makes it easy to select whole columns by selecting the data we want from the reader. We'll use the .next() method to find the column order and then iterate over the rows with a for loop to pull out height and weight.


In [37]:
with open('data/elderlyHeightWeight.csv', 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter='\t') # define the field delimiter

    # use next() method on reader object to id the headers
    headers = next(reader)
    print(headers)
    
    # we now know weight index is 2, height index is 3
    
    weight  = ['Weight'] # list to hold data, put in header 
    height = ['Height']

    for row in reader:
        weight.append(row[2])
        height.append(row[3])
    
    print (weight)
    print (height)


['Gender', 'Age', 'body wt', 'ht']
['Weight', '63.8', '56.4', '55.2', '58.5', '64', '51.6', '54.6', '71', '75.5', '83.9', '75.7', '72.5', '56.2', '73.4', '67.7', '93', '95.6', '75.6']
['Height', '155.5', '160.5', '159.5', '151', '165.5', '167', '154', '153', '171', '178.5', '167', '171.5', '167', '168.5', '174.5', '168', '168', '183.5']

The iterable in the

for...

loop above is each row of the input file. From each row we simply capture the two values we want and add these to lists. We could then further process the data in these two lists.

Writing files

In order to open a file for writing we use the 'w' parameter in our open() statement. Rather obviously 'w' stands for write. If the file doesn't exist a new file is created with the given name and extension.

Note that if the file exists then opening it with the 'w' argument removes any data that was in the file and overwrites it with what you put in. This may not be what you wanted to do. We'll cover how you append data to a file without overwriting the contents shortly.

Once we have an open file we can write data to it with the write() method applied to the file handle.

Let's open a file and write some data to it.


In [40]:
with open('data/test.txt', 'w') as f_out:
    for i in range(10):
        line  = 'Line ' + str(i) + '\n'
        f_out.write(line)

If you run the above code a new file should appear in your data directory (notice we opened the writeable file in the /data directory) called test.txt. That file should have 10 lines in it with the word 'Line' and a number from 0-9.

In the above code we first opened (created) the file test.txt and then ran through a range of numbers (from 0 to 9) using a for loop. At each iteration of the loop we concatenated (joined) the word 'Line' to the string representation of the number (note the use of str) and a newline character. Finally we wrote each of the resulting strings to our new file. In the last line we closed the file.

Putting it together!

Write a script that uses the csv module to open a file after getting a filepath from the user. Use the script to open the elderlyHeightWeight.csv file. Write out a new file containing only male data. Remember to close all the files once your done. In addition include a try\except clause to handle the situation where the requested file doesn't exist.

Hint: csv.reader objects are lists. Recall how you .join() lists elements into a string.

Adding data to an existing file

As noted above if you open an existing file and write data to it all that pre-existing data gets over written. That's not usually what you want to do. In fact in general you probably never want to write to any file that has raw data you are going to analyse in it - because you might lose or screw-up your original data. Sometimes however you might want to add new measurements (perhaps taken over time) to an existing file. For these cases there's the 'a' argument to the open() function. The a stands for append. Let's take the file containing only the male data we wrote in the last exercise, open it in append mode and write the female data to that file.


In [7]:
import csv

# assumes your file was called male_data.tsv
try:
    with open('data/male_data.tsv', 'a') as new_file, open('data/elderlyHeightWeight.csv', 'r') as f_hand: 
        reader = csv.reader(f_hand, delimiter='\t') # define the field delimiter
        male_data = [line for line in reader if line[0] == 'M']
        for line in male_data:
            new_file.write('\t'.join(line)+'\n')
except FileNotFoundError:
    print('The file does not exist.')

Processing and writing file data

Let's do something a bit more useful than just copying data around from one file to another. Often when we have demographic data like this one of the things we want to do is create new variables from that data. The elderlyHeightWeight.csv file contains... eh, well... height and weight data from a sample of elderly study participants. One obvious new variable we could create from this is BMI. However we'll save that for the exercise!

Instead we'll demonstrate the process by converting the height from cm to m - a simple division by 100. We can write this data to a new column. The strategy we'll use is to read each field of the data into a separate list. We will process the appropriate list and then use the the writer() method of the csv module to write our new file including processed height data.

We'll use a slightly different approach here from that demonstrated above (previous height & weight example). Instead of iterating over the rows we'll use iterator variables in our for loop.


In [15]:
# import csv  - done above

from collections import defaultdict
with open('data/elderlyHeightWeight.csv', 'r') as f_hand:
    csv_info = dict()
    reader = csv.DictReader(f_hand, delimiter='\t') # define the field delimiter
    for entry in reader:
        for key, value in entry.items():
            if key not in csv_info:
                    csv_info[key] = []  # initialise as an Empty list
            csv_info[key].append(value)
        
for key, value in csv_info.items():
    print('{}: \n\t {}'.format(key, value))


Gender: 
	 ['F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M']
Age: 
	 ['77', '80', '76', '77', '82', '78', '85', '83', '79', '75', '79', '84', '76', '80', '75', '75', '78', '80']
body wt: 
	 ['63.8', '56.4', '55.2', '58.5', '64', '51.6', '54.6', '71', '75.5', '83.9', '75.7', '72.5', '56.2', '73.4', '67.7', '93', '95.6', '75.6']
ht: 
	 ['155.5', '160.5', '159.5', '151', '165.5', '167', '154', '153', '171', '178.5', '167', '171.5', '167', '168.5', '174.5', '168', '168', '183.5']

In this code snippet we first initialised four lists - one to hold each column of our data. We then iterated over the columns of the data and assigned each value to its relevant list variable.

If you examine these lists you'll see that the first entry is the column header (which is handy for tracking data) and the other entries are the actual data for that column in the original file.


In [43]:
print (height)


['ht', '155.5', '160.5', '159.5', '151', '165.5', '167', '154', '153', '171', '178.5', '167', '171.5', '167', '168.5', '174.5', '168', '168', '183.5']

Now we have the data separated out it's a trivial effort to calculate the height in meters (from the given height in cm). In the code below we use the range() function to get the positions of the actual heights (i.e. we skip the column header), we convert those heights from str to float and we calculate the height in meters and append this to a new list.


In [44]:
height_m = []
height_m.append('ht_m') # a new header

# use range(1,len(height)) so we don't get the header again
for ht in height[1:]:
    height_m.append(float(ht)/100) # note the conversion to a float here

print (height_m)


['ht_m', 1.555, 1.605, 1.595, 1.51, 1.655, 1.67, 1.54, 1.53, 1.71, 1.785, 1.67, 1.715, 1.67, 1.685, 1.745, 1.68, 1.68, 1.835]

Now we have all the data we need to write the new file. First we'll capture each line of our new file to a list (the zip() function) and then write each line to the new file. The csv library extends the .write() method with a writer object. One method of writer objects is .writerow() the use of which is demonstrated below.


In [45]:
with open('data/new_data.csv', 'w') as newdata_file:
    writer = csv.writer(newdata_file, delimiter='\t') # define a writer object
    
    # iterate over data and write to file
    # use zip to create list of tuples for writing
    for row in zip(gender, age, weight, height, height_m):
        writer.writerow(row)

Remember that the zip() function will create an iterator (i.e. zip object) made up of tuples. In the example above the use of zip() creates a sequence the first element of which is all the first elements of our data lists, the second list element is all the second elements etc. It's easier to see this than explain it.


In [47]:
zip_sequence = zip(gender, age, weight, height, height_m)
print(type(zip_sequence))


<class 'zip'>

In [48]:
print (gender[:4])
print (age[:4])
print (weight[:4])
print (height[:4])
print (height_m[:4])
print # just a blank line
print (list(zip_sequence)[:4])


['Gender', 'F', 'F', 'F']
['Age', '77', '80', '76']
['body wt', '63.8', '56.4', '55.2']
['ht', '155.5', '160.5', '159.5']
['ht_m', 1.555, 1.605, 1.595]
[('Gender', 'Age', 'body wt', 'ht', 'ht_m'), ('F', '77', '63.8', '155.5', 1.555), ('F', '80', '56.4', '160.5', 1.605), ('F', '76', '55.2', '159.5', 1.595)]

The first element in each of our data lists is the column header. The zip() function captures these first elements into a tuple - ('Gender', 'Age', 'body wt', 'ht', 'ht_m') - and this, in turn, becomes the first element of a new list, data_out. The zip() function then captures all the second elements from each data list and these become part of a tuple which is the second element of data_out. In this way each data list is 'zipped up' with the other lists.

To output the rows we simply iterate over the data_out list and send each element to our output file as a row using the .writerow() method.

Putting it together 1

Open the elderlyHeightWeight.csv using the functions in the csv module and extract each column to a separate list. Use the height and weight data to calculate the BMI for each subject. Use zip() to create a list of data to write out and write all the phenotype data including BMI back to a new file.

Hint - if you use the csv.reader() remember the issues with the str type in lists.

Putting it together 2

Read the file you just created back in and select only those trial participants who are obese. Print the sex, age and BMI of these people. Obese means a BMI of 30 or more.

Homework

The nhanes.tsv file in the data directory contains data on 4581 Americans aged from 20 to 70 from the 2011-2012 NHANES survey. The data included are

  • individual number (unique ID for each individual in NHANES)
  • age (years)
  • sex (1 = M, 2 = F)
  • weight (kg)
  • height (cm).

Write a script that will read this data and count the number of NA values in height and /or weight and count the number of males and females.

Calculate the BMI for each individual, add this to the original file and write out a new file indluding BMI data.

Finally calculate the mean BMI for males and females and write these out as well (to 2 decimal places).

Hint: In this exercise you should use the techniques you have learned to loop over the lines of a file and extract each variable into its' own list. You can then calculate the BMI values easily. However you won't be able to calculate a BMI for individuals with 'NA' in either weight or height columns. How can you use the continue keyword when you loop over your data to avoid collecting values for these individuals?